Skip to content

feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1

Open
jdsika wants to merge 1 commit intomainfrom
feat/deterministic-output
Open

feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1
jdsika wants to merge 1 commit intomainfrom
feat/deterministic-output

Conversation

@jdsika
Copy link
Copy Markdown

@jdsika jdsika commented Mar 25, 2026

Summary

Add a --deterministic flag to OWL, SHACL, and JSON-LD generators that produces byte-identical output across invocations, eliminating spurious diffs in version-controlled artifacts.

This is a review-ready fork of the approach discussed in upstream linkml/linkml#3295, rebuilt to address maintainer feedback.

Problem

Generated OWL and SHACL artifacts contain blank nodes whose identifiers change between runs due to Python dict ordering and rdflib serialization non-determinism. This makes version-controlled artifacts show massive diffs even when the underlying schema change is trivial.

Solution

Three-Phase Hybrid Pipeline (deterministic_turtle())

  1. RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph ensures isomorphic inputs produce identical triple sets.
  2. Weisfeiler-Lehman structural hashing replaces sequential _:c14nN identifiers with content-based hashes. These depend only on predicate IRIs, literal values, and named-node IRIs — not on blank-node numbering — so adding or removing a triple only affects directly involved blank nodes.
  3. Hybrid rdflib re-serialization parses the canonicalized, WL-hashed triples back into an rdflib Graph and serializes with rdflib's native Turtle writer. This recovers idiomatic Turtle features that pyoxigraph cannot emit:
    • Inline blank nodes ([ … ]) for singly-referenced blank nodes (Turtle §2.7)
    • Collection syntax (( … )) for rdf:List chains (Turtle §2.8)
    • Prefix filtering: only prefixes actually used in the graph's IRIs are declared, following the practice of Apache Jena, Eclipse RDF4J, and Raptor

All triples from the source graph are preserved — the hybrid step only changes syntactic form, never semantic content. Plain string literals have their xsd:string datatype stripped per RDF 1.1 §2.5.1 (simple literals are syntactic sugar for xsd:string).

Additional Features

Collection sorting (gated behind --deterministic):

  • owl:oneOf, sh:in, sh:ignoredProperties items are sorted when the flag is set
  • Preserves existing behaviour by default

deterministic_json():

  • Recursive deep-sort for JSON-LD context output

Benchmark Results

Tested on the Gaia-X Trust Framework ontology (~68K OWL / ~165K SHACL triples) and schema.org (~18K triples):

Semantic Equivalence

Artifact Triples (det) Triples (non-det) rdflib.compare.isomorphic()
OWL 68,178 68,178 True
SHACL 165,029 165,029 True
schema.org 17,949 17,949 True

Byte-Level Stability

Test Deterministic Non-Deterministic
SHA-256 identical across runs ❌ (~1,400 lines differ)

Diff Quality (Signal-to-Noise Ratio)

Controlled mutations on a LinkML schema:

Mutation Generator Deterministic Non-Deterministic Noise Reduction
Change 1 description OWL 1 line 344 lines 344×
Change 1 description SHACL 13 lines 290 lines 22×
Add 1 new slot (+18 triples) OWL 16 lines 350 lines 22×
Add 1 new slot (+7 triples) SHACL 23 lines 305 lines 13×

Output Size (Gaia-X Trust Framework)

Artifact Before (pyoxigraph-only) After (hybrid) Change
OWL 58,397 lines, 34 prefixes 58,291 lines, 11 prefixes -0.2%
SHACL 163,993 lines, ~34 prefixes 9,118 lines, ~9 prefixes -94%

The SHACL 18× size reduction comes from replacing 157,552 named _:bHASH blank nodes with inline [ … ] syntax and 77,358 explicit rdf:first/rdf:rest triples with ( … ) collection shorthand — matching the upstream Gaia-X registry convention.

Performance

Graph Triples Time Throughput
schema.org 17,949 1.5s ~12,000 triples/s
Gaia-X OWL 68,178 ~5s ~14,000 triples/s

Dependency

pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used. It is not a core dependency, avoiding conflict with morph-kgc's pin on pyoxigraph < 0.4.0. Tests skip gracefully when pyoxigraph >= 0.4.0 is unavailable.

Relationship to upstream linkml#3295

The original PR was closed after maintainer feedback requesting an established canonicalization standard. This PR:

  1. Replaces custom WL canonicalization with W3C RDFC-1.0 via pyoxigraph
  2. Retains WL only for blank node ID assignment (post-canonicalization remapping for diff stability — not a canonicalization algorithm)
  3. Adds hybrid rdflib re-serialization for idiomatic Turtle output (inline blank nodes, collection syntax, prefix filtering)
  4. Makes pyoxigraph an optional lazy import

Testing

  • 37 tests covering idempotency, isomorphism, performance, diff quality, and edge cases
    • test_deterministic_output.py: 27 tests (stability, sorting, prefix format, enum ordering, kitchen_sink)
    • test_deterministic_benchmark.py: 10 local + 4 network tests (schema.org equivalence, mutation diff quality, signal-to-noise assertions)
  • Tests skip cleanly when pyoxigraph >= 0.4.0 is not available
  • All existing tests pass (9500+ across the full matrix)

Benchmark Test Assertions

The benchmark enforces quantitative properties:

  • Deterministic diff ≤ 20 lines for a single description change
  • Signal-to-noise ratio ≥ 5× (actual: 13–344×)
  • Diff proportional to new triples (≤ 5× margin)
  • SHA-256 byte-identical across 3 consecutive runs
  • Every declared prefix has at least one IRI using it
  • schema.org (17,949 triples) serializes in < 60s

References

@jdsika jdsika force-pushed the feat/deterministic-output branch from bdb0f7a to 6544b72 Compare March 25, 2026 16:03
@jdsika jdsika self-assigned this Mar 25, 2026
@jdsika jdsika force-pushed the feat/deterministic-output branch 7 times, most recently from f0a081a to 7f529d6 Compare March 28, 2026 14:23
@jdsika jdsika force-pushed the feat/deterministic-output branch from 7f529d6 to 37cafc8 Compare March 29, 2026 19:34
@jdsika jdsika changed the title feat(generators): add --deterministic flag for reproducible output (pyoxigraph RDFC-1.0) feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization Mar 30, 2026
@jdsika jdsika force-pushed the feat/deterministic-output branch 3 times, most recently from fb47790 to 8016a4b Compare April 2, 2026 09:34
@jdsika
Copy link
Copy Markdown
Author

jdsika commented Apr 2, 2026

🔍 Adversarial Review — PR #1

Summary

A well-engineered feature with strong documentation and benchmark data. The three-phase pipeline (RDFC-1.0 → WL → rdflib) is architecturally sound but introduces significant complexity. I found 2 bugs (dead code shipped as functional features), 1 algorithmic concern in collision handling, and several design/test gaps worth addressing before merge.


🐛 Bugs & Issues

1. Dead code: normalize_prefixes field and CLI option do nothing

The normalize_prefixes: bool = False field is added to Generator (line ~429) and a --normalize-prefixes/--no-normalize-prefixes CLI option is registered — but self.normalize_prefixes is never read in any generator code in this PR. Users who pass --normalize-prefixes get silent no-op behavior. This should either be removed from this PR (it belongs in PR #4) or wired up.

# Added to Generator dataclass but never checked anywhere:
normalize_prefixes: bool = False

2. Dead code: well_known_prefix_map() defined but never called

The function is defined in generator.py but is not imported or called anywhere in this PR. It returns {str(ns): str(pfx) for pfx, ns in Graph().namespaces() if str(pfx)} — a dynamic map that changes across rdflib versions (see Concern #3). PR #4 redefines this with a static frozen map, creating a direct merge conflict.

3. WL collision counter assignment depends on c14n ordering, not structure

In _wl_signatures(), when two structurally identical blank nodes produce the same WL hash, the counter suffix (_0 vs _1) is assigned by iterating sorted(bnode_ids) — which sorts by canonical c14n ID (e.g., "c14n0" < "c14n42"):

for bid in sorted(bnode_ids):  # sorted by c14n ID, NOT by structure
    digest = hashlib.sha256(sig[bid].encode("utf-8")).hexdigest()[:12]
    count = seen_hashes.get(digest, 0)
    seen_hashes[digest] = count + 1
    label = f"b{digest}" if count == 0 else f"b{digest}_{count}"

Adding an unrelated triple can change RDFC-1.0 numbering, which changes which colliding node gets the base label vs _1 suffix. This defeats WL's core promise of diff-stable IDs in the (admittedly rare) collision case. Fix: use the full WL signature string as a secondary sort key before assigning counters:

for bid in sorted(bnode_ids, key=lambda b: (sig[b], b)):

⚠️ Concerns

1. well_known_prefix_map() is rdflib-version-dependent (direct conflict with PR #4)

Graph().namespaces() returns different prefix sets across rdflib 6.x vs 7.x (e.g., brick, csvw, geo were added/changed). This means "deterministic" output can change on dependency upgrade. PR #4 fixes this with _WELL_KNOWN_PREFIX_MAP: MappingProxyType — a static frozen map. Both PRs also add normalize_prefixes to Generator. These are two direct merge conflicts.

Recommendation: Remove well_known_prefix_map() and normalize_prefixes from this PR entirely. They're unused dead code here and belong exclusively in PR #4.

2. Sorting OWL expression members by repr is fragile

members = sorted(members, key=repr)

repr() for LinkML model objects depends on dataclass field ordering and internal representation. While stable within a single Python version, it can change across:

  • Python versions (if dataclass repr format changes)
  • linkml-runtime versions (if fields are reordered or types change)
  • Any field containing objects with id()-based repr

A more robust approach: define an explicit sort key using stable, semantic fields (e.g., key=lambda x: (x.range or "", x.minimum_value or "")) or serialize to a canonical string.

3. _mutate_kitchen_sink writes temp files to the test input directory

out_path = ks_path.parent / f"_benchmark_mutated_{os.getpid()}_kitchen_sink.yaml"

This writes to tests/linkml/test_generators/input/ which may be read-only in containerized CI. Also, yaml.dump(yaml.safe_load(...)) reformats the entire schema (quoting styles, flow/block, comment stripping), so the benchmark diffs measure YAML reformatting noise in addition to the intended mutation. Consider using tmpdir fixture or string manipulation instead of full YAML round-trip.

4. No RDFC-1.0 timeout for pathological graphs

The W3C RDFC-1.0 spec acknowledges exponential worst-case complexity for certain graph topologies. pyoxigraph.Dataset.canonicalize() has no timeout parameter exposed. A signal.alarm() or thread-based timeout would protect CI from hangs on adversarial input.

5. _deep_sort doesn't propagate parent_key into list items

sorted_items = [_deep_sort(item) for item in value]  # parent_key defaults to ""

If a list item is itself a list, it will be sorted regardless of whether the grandparent key is in _JSONLD_ORDERED_KEYS. Not a practical JSON-LD issue today, but a latent correctness gap.

6. O(n×m) prefix filtering

if pfx_s and any(iri.startswith(ns_s) for iri in used_iris):

With ~30 prefixes × thousands of IRIs, this is quadratic. A trie or pre-sorted binary search would scale better for large ontologies (though the current 68K-triple benchmarks likely absorb this).

7. xsd:string stripping is correct but lossy

The code strips explicit xsd:string datatype annotations during pyoxigraph→rdflib conversion. This is correct per RDF 1.1 §2.5.1 (simple literals = xsd:string), but users who intentionally annotated xsd:string for tooling compatibility lose the annotation after round-trip. Worth documenting this in the docstring.


🧪 Test Coverage Assessment

Strong coverage (✅):

  • Byte-level idempotency across runs (4 generators)
  • Sorted key verification for JSON generators
  • Prefix format (@prefix vs PREFIX)
  • Large schema stability (kitchen_sink)
  • xfail documentation of non-isomorphism trade-off
  • Non-deterministic mode regression guard

Gaps (❌):

  • No test for WL hash collisions: Need a graph with two structurally identical blank node subgraphs to verify counter-based dedup produces stable output
  • No test for deep blank node nesting (>4 levels): 4 WL iterations may be insufficient for chains of 5+ structurally similar nodes
  • No test for well_known_prefix_map(): Dead code, but if kept, needs coverage
  • No test for normalize_prefixes: Dead code — the CLI option silently does nothing
  • xfail tests document non-isomorphism but don't verify semantic equivalence: The assertion "OWL/SHACL interpret these as unordered sets" should be backed by a test that verifies the same classes/constraints exist in both modes (e.g., compare extracted sh:in value sets, not just triple counts)
  • Benchmark YAML round-trip may inflate diff counts: yaml.dump(yaml.safe_load(original)) reformats the entire file — diff measurements may include YAML formatting noise, not just the intended mutation
  • Network tests (@pytest.mark.network): schema.org download can fail in CI. No conftest.py marker filtering shown — these may run and flake by default
  • Performance test (10s limit): Fragile on slow CI runners; consider a more generous threshold or @pytest.mark.slow marker

📋 Fix Plan

  1. Remove normalize_prefixes and well_known_prefix_map() from this PR — they are dead code, belong in PR fix(generators): add --normalize-prefixes flag for well-known prefix names #4, and create merge conflicts
  2. Fix WL collision counter ordering: Sort by (sig[bid], bid) instead of just bid to make counter assignment structure-dependent
  3. Add a WL collision test: Create a graph with two identical blank node subgraphs, verify both get stable IDs across runs
  4. Replace repr sorting with explicit key function for OWL expression members
  5. Use tmp_path fixture for _mutate_kitchen_sink output (create a symlink or copy imports alongside)
  6. Add @pytest.mark.slow to performance tests and document CI threshold expectations
  7. Document xsd:string stripping in deterministic_turtle() docstring explicitly as a known behavior

✅ What's Good

  • Excellent documentation: Thorough docstrings with W3C spec references, clear parameter docs, and well-written PR description with benchmark tables
  • Lazy import pattern: pyoxigraph is only loaded when --deterministic is used; graceful ImportError with actionable message
  • Defensive test design: pytestmark = pytest.mark.skipif(not _has_pyoxigraph, ...) and fixture-level pytest.skip() for network tests
  • The hybrid pipeline is clever: RDFC-1.0 for correctness, WL for diff stability, rdflib for Turtle readability — each phase has a clear purpose
  • The 94% SHACL size reduction (inline blank nodes + collection syntax) is a compelling result
  • xfail tests properly document trade-offs rather than hiding them
  • Prefix filtering removes the ~27 rdflib default bindings that leak into output — real quality-of-life improvement

…lib serialization

Add a --deterministic / --no-deterministic CLI flag (default off) to OWL,
SHACL, JSON-LD Context, and JSON-LD generators that produces byte-identical
output across invocations.

Three-phase hybrid pipeline for Turtle generators:
1. RDFC-1.0 canonicalization (W3C Recommendation) via pyoxigraph
2. Weisfeiler-Lehman structural hashing for diff-stable blank node IDs
3. Hybrid rdflib re-serialization for idiomatic Turtle (inline blank
   nodes, collection syntax, prefix filtering)

JSON generators use deterministic_json() with recursive deep-sort and
JSON-LD-aware key ordering that preserves conventional @context structure.

Collection items (owl:oneOf, sh:in, sh:ignoredProperties) are sorted
when --deterministic is set to ensure reproducible RDF list order.

pyoxigraph >= 0.4.0 is imported lazily only when --deterministic is used.
Tests skip gracefully when pyoxigraph is unavailable.

Refs: linkml#1847
Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
Signed-off-by: jdsika <carlo.van-driesten@bmw.de>
@jdsika jdsika force-pushed the feat/deterministic-output branch from cfaba19 to c4ecf10 Compare April 2, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant